从transformer的实现,看深度学习的trick

简介

tensor2tensor中实现了常用的模型,主要包括transformer等

分解

  • dropout
  • norm_epsilon
  • moe
  • sampling_method=”argmax”, # “argmax” or “random”
  • use_target_space_embedding
  • conv_first_kernel

dropout

  • 在哪dropout,为什么?
  • dropout的比例?
1
2
3
"dropout": 0.2
"relu_dropout": 0.1,
"relu_dropout_broadcast_dims": "",

LayerNorm

LayerNorm(x + Sublayer(x)),

residual connection

layer normalization

residual之后加的LN

share vocab

share the vocabulary between source and target language

对应的参数:

1
"shared_embedding": false

共享词典的

  • 优势:
    • 方便专有名词的直接拷贝,比如人名地名
  • 缺陷:
    • vocab_size较大,会增大softmax的负担。
    • src和target会混淆。比如英中翻译,可能翻译出来的夹杂英文。

激活函数的选择

moe

1
2
3
4
5
6
7
model_dir": "./t2t_train/languagemodel_lm1b32k/transformer-transformer_small",
"moe_hidden_sizes": "2048",
"moe_k": 2,
"moe_loss_coef": 0.001,
"moe_num_experts": 16,
"moe_overhead_eval": 2.0,
"moe_overhead_train": 1.0,

tie embedding

1
2
3
# Nematus的配置: https://github.com/EdinburghNLP/nematus#network-parameters
tie_decoder_embeddings: tie the input embeddings of the decoder with the softmax output embeddings
tie_encoder_decoder_embeddings: tie the input embeddings of the encoder and the decoder (first factor only). Source and target vocabulary size must be the same

share the weights between softmax and embedding

对应参数:

1
shared_embedding_and_softmax_weights = True

源码
该trick,仅当src和target

lr相关

1
2
3
4
5
6
7
8
9
10
learning_rate": 0.2,
"learning_rate_constant": 2.0,
"learning_rate_cosine_cycle_steps": 250000,
"learning_rate_decay_rate": 1.0,
"learning_rate_decay_scheme": "noam",
"learning_rate_decay_staircase": false,
"learning_rate_decay_steps": 5000,
"learning_rate_minimum": null,
"learning_rate_schedule": "constant*linear_warmup*rsqrt_decay*rsqrt_hidden_size",
"learning_rate_warmup_steps": 8000,

lr变化曲线

weight相关

“weight_decay”: 0.0,
“weight_dtype”: “float32”,
“weight_noise”: 0.0

optimizer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
"optimizer": "Adam",
"optimizer_adafactor_beta1": 0.0,
"optimizer_adafactor_beta2": 0.999,
"optimizer_adafactor_clipping_threshold": 1.0,
"optimizer_adafactor_decay_type": "pow",
"optimizer_adafactor_factored": true,
"optimizer_adafactor_memory_exponent": 0.8,
"optimizer_adafactor_multiply_by_parameter_scale": true,
"optimizer_adam_beta1": 0.9,
"optimizer_adam_beta2": 0.997,
"optimizer_adam_epsilon": 1e-09,
"optimizer_momentum_momentum": 0.9,
"optimizer_momentum_nesterov": false,
"optimizer_multistep_accumulate_steps": null,

bucket

1
2


decode - length penalty - alpha

entropy loss - 用于避免重复解码

为什么会重复解码?因为length penalty会倾向于长句子。

复杂度作为penalty,input句子和output句子

也可以理解为